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Abstract 

In this paper we develop a statistical theory and an implementation of deep 
learning (DL) models. We show that an elegant variable splitting scheme 
for the alternating direction method of multipliers (ADMM) optimises a deep 
learning objective. We allow for non-smooth non-convex regularisation penal¬ 
ties to induce sparsity in parameter weights. We provide a link between tradi¬ 
tional shallow layer statistical models such as principal component and sliced 
inverse regression and deep layer models. We also define the degrees of free¬ 
dom of a deep learning predictor and a predictive MSE criteria to perform 
model selection for comparing architecture designs. We focus on deep multi¬ 
class logistic learning although our methods apply more generally. Our results 
suggest an interesting and previously under-exploited relationship between 
deep learning and proximal splitting techniques. To illustrate our methodol¬ 
ogy, we provide a multi-class logit classification analysis of Fisher's Iris data 
where we illustrate the convergence of our algorithm. Finally, we conclude 
with directions for future research. 
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1 Introduction 


Deep Learning (DL) provides a powerful tool for high dimensional data reduction. 
Many areas of applications in predictive modeling occur in artificial intelligence 
and machine learning; including pattern recognition Ripley [1996]; computer vi¬ 
sion Dean et al. [2012]; image segmentation and scene parsing Farabet et al. [2013]; 
predictive diagnostics; intelligent gaming Mnih et al. [2013]. The salient feature 
of a deep learning model is a predictive rule comprised by a layered composition 
of link or activation functions. Deep architectures with at least three layers have 
been shown to provide improved predictive performance compared to traditional 
shallow architectures in a number of applications. The challenge for deep learning 
methodologies, however, are computational: the objective function which mea¬ 
sures model fit is typically highly multi-modal and hard to optimise efficiently. 

We build on the extensive deep learning literature by showing that proximal 
Bayesian optimisation techniques provide a turn-key solution to estimation and 
optimisation of such models and for calculating a regularisation path. We allow for 
the possibility of irregular non-convex regularisation penalties to induce sparsity 
in the deep layer weights. Proximal operators and the alternating direction method 
of multipliers (ADMM) are the key tools for implementation. This approach sim¬ 
ply re-writes the unconstrained optimisation as a constrained one, with a carefully 
constructed sequence of auxiliary variables and envelopes, to deal with the asso¬ 
ciated augmented Lagrangian; see Parikh and Boyd [2014], Poison et al. [2015], 
Green et al. [2015] for recent surveys. Proximal algorithms have achieved great 
success and provide a methodology for incorporating irregular non-differentiable 
penalties; see [Masci et al., 2013] for applications in the areas of computer vision 
and signal processing. 

From a statistical perspective, DL models can be viewed as generalised lin¬ 
ear models (GLM Davison [2003], Dellaportas and Smith [1993]) with recursively 
defined link functions. Traditional statistical models commonly use shallow net¬ 
works containing at most two layers or hierarchies. For example, reduced rank 
regression can be viewed as a deep learning model with only two layers and linear 
links. Support vector machines for classification Poison and Scott [2013] use a pre¬ 
dictive rule based on a rectified linear (or hinge) link. Recent empirical evidence, 
however, suggests improved statistical predictive performance with deep archi¬ 
tectures of at least three layers. Our focus here will be on developing fast learning 
methods tor deep multi-class logistic models although our methods apply more 
generally to recurrent and convolutional neural nets. Although well known in the 
Neural Network and Statistics literature [Knowles and Minka, 2011], efficient es¬ 
timation of the cross-entropy/multinomial loss with regularization-outside of the 
^^-ridge penalty-has not been a mainstream area of research. See Madigan et al. 
[2005] and Genkin et al. [2007] for applications to large scale multinomial logistic 
models. In general, the -ridge penalty is commonplace [Poggio and Girosi, 1990, 
Orr, 1995], mostly due to its differentiability. Our methods can therefore be seen 
as related to sparse Bayes MAP techniques; see Titterington [2004], Windle et al. 
[2013], Poison et al. [2015] for high dimensional data reduction. 
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Mainstream estimation within Deep Learning broadly revolves around gradi¬ 
ent descent methods. The main variation in techniques arises from general con¬ 
siderations for computational complexity and introduce more tuning parameters 
[Ngiam et al., 2011]. Such considerations are the basis for Stochastic Gradient De¬ 
scent (SGD) and back-propagation. For instance, back-propagation uses the chain 
rule for the derivative of the composite of activation functions. This can reduce 
the order-of-operations dramatically from naive direct evaluation while maintain¬ 
ing high numerical accuracy; however, this says little about the general difficultly 
in estimation of a non-linear objective function. Direct gradient methods can be 
poorly scaled for the estimation of deep layer weights, in contrast to our proximal 
splitting approach which overcomes this by providing a simultaneous block up¬ 
date of parameters at all layers. The largest networks (e.g. Dean et al. [2012]) are 
currently trained using asynchronous SGD. Farabet et al. [2013] discusses hard¬ 
ware approaches to faster algorithms. Providing training methodologies is a very 
active field of research. 

The splitting techniques common in the proximal framework do exist in the Al 
literature but we believe that their broad applicability and functional coverage has 
mostly been overlooked and under-exploited. This is possibly due to the afore¬ 
mentioned concerns of computational complexity that makes stochastic gradient 
descent (SGD) and back-propagation methods so popular, although it's quite pos¬ 
sible that the number of iterations to step-complexity can favor (parallel) proximal 
methods in some cases. We show that our augmented ADMM approach is embar¬ 
rassingly parallel with block updates for parameters and auxiliary variables being 
directly available due to our carefully chosen splitting procedure. 

Traditional approaches to deep learning use Back-propagation LeGun et al. 
[2012], Hinton et al. [2006], Hinton and Salakhutdinov [2006] which rely on the 
chain rule for the derivatives of the composite of the L layers in the network. We 
propose a proximal splitting approach which also lends itself to the inclusion of a 
non-differentiable regularisation penalty term so as to induce sparsity. Gombettes 
and Pesquet [2011] detail multiple splitting methods in the context of proximal 
operators as well as parallel proximal algorithms that handle many splitting vari¬ 
ables. Wang and Garreira-Perpinan [2012] perform splitting in a similar context 
as ours, but with an & loss and quadratic barrier approach to handle the equal¬ 
ity constraints. In contrast, we generalize to non-^^ and detail a general enve¬ 
lope approach that includes the well known augmented Lagrangian. Our general 
envelope approach allows one to utilize efficient bounds, such as those found in 
Bouchard [2007] and Knowles and Minka [2011]. 

The rest of the paper is outlined as follows. Section 2 introduces deep learning 
predictors. We show that standard statistical models such as principal components 
analysis, reduced rank and sliced inverse regression can be viewed as shallow ar¬ 
chitecture models. We also motivate the addition of further layers to aid in the reg¬ 
ularised prediction problem. We illustrate how proximal splitting methods solve 
a simple shallow architecture model. Section 3 describes the general deep learn¬ 
ing problem and its solution via proximal splitting methods. Supervised learning 
uses a training dataset to estimate the parameters of each layer of the network. We 
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define the degrees of freedom of a deep learning predictive rule which quantifies 
the trade-off between model fit and predictive performance. Section 5 provides a 
comparison of DL and Neural Network (NN) models in a multi-class deep logistic 
classification model for Fisher's Iris data. We also illustrate the convergence of our 
algorithm. Finally, Section 6 concludes with directions for future research. 


2 Deep Learning Predictors 

Let y G where S = IR^ tor regression and S = {1, ... ,K} for classification. Here 
y denotes an observed output associated with a high dimensional input/covariate 
variable given hy X = {X;} and x, G IR^. The generic problem is to find a 
non-linear predictor of the output y{X,W) where W are parameters which will be 
estimated from a training dataset of N outputs and inputs, denoted by {yi, X, 
When the dimensionality of X is high, a common approach is to use a data 
reduction technique. This is achieved by introducing a low-dimensional auxiliary 
variable z G where d M and constructing a prediction rule specified by a 
composition of functions. 


y = h(f2(x)) 

= /i(z) where z = fiiX). 

Now the problem of high dimensional data reduction is to find the z-variables 
using training data and to estimate the layer functions {fi, f 2 ) ■ One reason for the 
success of deep learning is the regularisation achieved through the hidden layer 
low-dimensional z variable. A hallmark of deep learning models is also the use 
of nonlinear layers. As such, one can view them as hierarchical nonlinear factor 
models or more specifically as generalised linear models (GLM) with recursively 
defined nonlinear link functions. 

The key to a layering approach is to uncover the low-dimensional z-structure 
in a way that doesn't disregard information about predicting the output y. We will 
show that our splitting scheme naturally uses the hidden layer z-variables. 

For example. Principal component analysis (PCA) reduces X using a SVD de¬ 
composition Wold [1956]. This type of dimension reduction is independent of y 
and can easily discard X information that is valuable for predicting the output. 
Sliced inverse regression (SIR) Cook and Lee [1999], Cook [2007] overcomes this by 
estimating the layer function / 2 , independently of /i, using data on both {y,X). 
Deep learning takes this one step further and jointly estimates fi and /2 in tandem 
using training data on both pairs {y,X). 

A common approach is to introduce parameters, W, at each layer by convo¬ 
lution with linear maps which we denote by /i(Wiz) and assuming cen¬ 

tered variables. Statistical models are traditionally shallow architectures with at 
most two layers. Reduced rank regression corresponds to linear link functions with 
/i(z) = Wiz where z = fiix) = W 2 X. The dimensionality reduction then becomes 
an eigen-problem for Wi and W 2 . Radial basis functions/kernel sliced inverse regression 
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uses fi (z) = W0(z) where 0 is a set of kernels/basis functions. In many cases, we 
will also add a penalty, </>( W), for parameter estimation. The term (p(W) is a regu¬ 
larization term that imposes structure or effects a favorable bias-variance trade-off 
in prediction. We need to be able to account for non-smooth penalties such as lasso 
(p{yV) = 7 IW /1 or bridge penalty to induce sparsity, where 7 > 0 is a scaling 
parameter that traces out a full regularisation path. 

A deep learning predictor takes the form of an L-layered convolution rule of 
link functions where 

y = /i(- . ./l-i(zl) • • •) with zl = /l(A’) . 

We define the set of layer z-variables are given by Z/ = W//fc(z;_|_i) assuming that 
the variables are centered. 

The original motivation for recursive deep learning predictors resides in the 
seminal work of Kolmogorov [1957], Lorenz [1963] and the extensions derived by 
Barron [1993], Poggio and Girosi [1990] and Bach [2014]. See Paige and Butler 
[2001] for a Bayesian approach to neural networks. The motivation behind lay¬ 
ered convolution maps lies in the following completeness result. Given a Gaus¬ 
sian nonparametric regression with any continuous function, f (x) on [ 0 , 1 ]^ and 
e ^ N(0,7) Poggio and Girosi [1990], there exists one dimensional link functions 
fi and such that 

lM+\ / M 

y = F{X)+(Te where F(A’) = £ 

1=1 \m=l 

This is a four layer network with a large number of units at the first layer. On the 
practical side, one may wish to build a deep network with more layers but less 
units at each stage. Section 4.2 provides a model selection approach to architecture 
design. 

Neural network models can simply be viewed as projection pursuit regression 
F(x) = Y}tn=i fi{F^ix) with the only difference being that in a neural network the 
nonlinear link functions, //, are parameter dependent and learned from training 
data. For example, a two-layer auto-encoder model with a sigmoid link function 
uses a prediction rule given by functions of the form Yli Wi / • a + bm,o)- 

The fi's are learned via the sigmoid function of the linear combination. 

The modeling intuition a deep learning architecture can be found within a two 
layer system. Suppose that we have a loss problem with a two layer rule 

rmnlly — y(T’, W) 11 ^ where y = Wi/i(W 2 / 2 (<T’)) . 

This objective can be highly nonlinear. Finding the optimal parameters is chal¬ 
lenging as traditional convex optimisation methods deal only with sums rather 
than composites of objective functions. Our approach will be based on augmented 
ADMM methods which still apply by splitting on the variables Zi = Wi/i (Z 2 ) and 
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Z 2 = Bertsekas [1976] provides a general discussion of ADMM methods. 

Now we need to solve an augmented Lagrangian problem of the form 

max mtn£(yV, Z,1C) 

/c w,z 

with IC = {ki,K 2 } and Z = {zi,Z 2 } where 

N 

c{w,zx) = Y. 

i=\ 

+ N ,2 ( Z ;,2 “ “ ^ 2 / 2 (^;)||^| 

for augmentation parameters 

Re-expressing in scaled Lagrangian form with y = k, j /fii j, j G {1/2}, gives 

N 

c{z,wx) = Y. 

i=\ 

+ ^||zu + - Wi/i(z,-2) 

+ ^^l|Zi,2 + 2 ^z,2 “ 

If we add an £^-norm penalty for W, we obtain ridge regressions steps in block 
W updates. The scaled Lagrangian saddle-point is solved via the iterative ADMM 
scheme Poison et al. [2015]. This solves the optimisation of a recursively defined set 
of link functions rather than the traditional sum of convex functions. See Lewis and 
Wright [2008] for convergence analysis of these layered optimisation problems. 

An important feature of our ADMM update for the parameters W = (Wi, W 2 ) 
is that it happens in tandem. Moreover, the regularisation steps are properly scaled 
by the current values of the functions of the augmented z-variables. On the other 
hand, a direct gradient-based approach such as back-propagation uses first order 
information based on the composite derivative (/i o f 2 )'■ This can easily lead to 
steps for the second layer parameters, namely W 2 , that are poorly scaled. Back- 
propagation can be slow to converge as it makes small zig-zag steps in the highly 
multi-modal objective. More efficient second order Hessian methods would re¬ 
quire more information and computation. Martens and Sutskever [2011] develops 
a Hessian-free Lagrangian approach that is based on an envelope that uses the 
derivative of the composite map I o f where I denotes the model measure of fit and 
/ a layer function. 

Whilst our augmented Lagrangian uses an -barrier, we can use general met¬ 
rics and still achieve convergence Bertsekas [1976]. A computationally attractive 
approach would be to match the type of ADMM barrier with the nonlinearity in 
the link function. We leave this for future research. 

We now turn to our Deep Bayes learning framework. Bayes provides a way of 


H2 

2 


u 


i, 2 \ 



\yi-Zi,lf + Hi (hi - ^ifliZi,!)) + - Wi/i(z,;2)f 
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Table 1: Link (activation) functions. Typical Lp-norms are p = 1 (lasso), p = 2 
(ridge) or oo (max-norm). RectLU/hinge norm is related to lasso via max(M, 0) = 
j{\u\ + u), see Poison et al. [2011]. Max-pooling is the sum of a hinge and lasso 
norms, max(|Mi|, \u 2 \) = max(|Mi| — |m 2 |/ 0 ) -|- \u 2 \. 


linear 

sigmoid 

softmax 

tanh 

log-sum-exp 

^P-norm 
rectLU/hinge 
max-pooling 


_ fkju) 

Au + b 
(1 e“)“^ 

eVLti 
2(l + e“)-i-l 
logE/e"‘ 

iL.Wiin' 

max(0, u) 
max{|u/J, \ ui.\} 


unifying the stochastic and algorithmic approaches to data Breiman [2001]. 


3 Deep Bayes Learning 

A deep learning predictor, y(x, W) G R^, depends recursively on a set of link 
functions, fkA Ak < L, defined by 

yi := y(x, W) = Wi/i {mfii- • • (/l-i(WlX, + & 0 ) • ■ •) + ^ 2 ) • 

The link or activation functions // : > R^' are pre-specified and part of the 

architecture design. Specific choices includes sigmoidal; softmax; tanh; rectLU or log- 
sum-exp see Table 1. We use L to denote the number of layers in the network. The 
parameters b; G RX, with 1 < I < L and bi = 0, are off-sets or bias parameters. 
The parameters W can be decomposed as 

W = (Wi,..., Wl) for Wi G for 1 < / < L and Nl = M, 

To construct a posterior distribution, p( W|y), we use a prior distribution, p(>V) oc 
exp {—(p{yV)), where ^(W) is a regularisation penalty. 

Bayes rule yields the posterior distribution over parameters given data, namely 

p(W\X,y)^p(!)\^(X,W))p(W) 

ixexp{-\o^p(tl\g(X,W))-(tiiW)) . 

The deep learning estimator, W, is a Bayes MAP estimator and leads to a pre- 
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diction rule 


y{X,'W) where W := argmax p(W|d::’,i/) . 

w 

Maximising the posterior distribution corresponds to finding the argmin of the 
deep learning objective function. Taking a Bayesian perspective allows the re¬ 
searcher to characterise the full posterior distribution, p{W\y), typically, via sim¬ 
ulation methods such as MCMC. This can aid in finding parameter uncertainty 
estimates and in determining efficient prediction rules. 

The log-likelihood can also be interpreted as gauging the accuracy of a pre¬ 
diction, y, of outcome, y. The underlying probability model, denoted by p{y\y), 
determined this fit via 

Ky>y) = -logp(yly) . 

The function l{y,y) is typically a smooth function such as an £^-norm ||y — y|| 
or a cross-entropy metric l{y,y) = ylogy when dealing with classification. The 
statistical interpretation is as a negative log-likelihood, in machine learning a cost 
functional. The major difficulty in implementing deep learning models is the high 
degree of nonlinearity induced by the predictor, y(W, X), in the likelihood. 

We will pay particular attention to the multi-class deep logistic learning model. 
The first layer is given by (r( Wx) = which defines the vector sigmoid 

function (j{x). 

Example 1 (Deep Logistic Learning). Suppose that our observations yt represent a 
multi-class 1-of-K indicator vector, which we equate with class k via y = kfor 1 <k < K. 
Given a set of deep layer link functions with zi = ff-), we have a predictive rule 

y{z,yV) = p{y = k\z,W) = cri,{Wiz) . 

The negative log likelihood is given by 


K 

Kyuyi) =iogp(ydyi) = -iogn(ya)^''*^ 

k=\ 

K 

= - Ey^^ogya • 

k=l 

Minimising the cross-entropy cost functional is therefore equivalent to a multi-class logis¬ 
tic likelihood function. Genkin et al. [2007] and Madigan et al. [2005] provide analysis of 
large-scale multinomial logit models with shallow architectures. 

The basic deep learning problem is supervised learning with a training dataset 
{yi, We find the deep network structure via an optimisation of the following 

form 

argmin {/(y,, y(x,, W)) +(/)(W) +(/)(2:)} 

W (1) 

where y{x„W) = Wi/i (W 2 / 2 (. • • (/l-i(WlX, + 1 ,^))...) + 1 , 2 ) 

Again l{y,y) is a measure of fit depending implicitly on some observed data y 
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and a prediction rule y(W, A’). Here y denotes an N-vector of outcomes and X a 
corresponding set of N-many M-vector characteristics; for example, pixels in an 
image or token counts in a topic model for document classification. 

To solve the deep learning objective function with non-differentiable regular- 
isation penalties we use an auxiliary latent variable scheme in the context of an 
augmented Lagrangian. This allows us to write the deep learning optimisation as 
a constrained problem 


N 

argmin + (p{yV) + (p{Z) 

i=i 

where Z/ = W;//(z;+i) + bi 1 < / < L 
zl = WiXi + hi 


( 2 ) 


Through variable splitting we can introduce latent auxiliary variables, Z/ G R^', 
where//(z/_|_i) : R^'+i —^ R^'+G Notice that we also allow for regularisation of the 
Z variables so as to include sparsity 

Our approach follows the alternating pattern of ADMM and Douglas-Rachford 
type algorithms, which for a split objective given by 

min/(w) + g{z) where Aw — Bz = 0 
performs the following iterations 

_ argmin 

W 

2 (^+ 1 ) = argmin 

Z 


/(w) + |||z(;-(zW-uW) 
g{z) + ^ z - + 




for some y > 0. 

Roughly speaking, variable splitting enables one to formulate an iterative solu¬ 
tion to the original problem that consists of simple first-order steps, or independent 
second-order steps, since it "decouples" the functions, which in our case are the 
penalty, layers, and loss functions. Now, potentially simpler minimization steps 
on the splitting variables and observation are combined, often in a simple linear 
way, to maintain the "coupling" that exists in the original problem. In such a set¬ 
ting, one can craft sub-problems-through the choices of possible splittings-that 
have good conditioning or that suit the computational platform and restrictions, 
all without necessarily excluding the use of standard and advanced optimization 
techniques that may have applied to the problem in its original form. 

For instance, the same Newton-steps and back-propagation techniques can be 
applied on a per-layer basis. For that matter, one can choose to split at the lowest 
layer (i.e. between the loss function and layers above) and allow fi to be a compli¬ 
cated composite function. This separates the problem of minimizing a potentially 
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complicated loss from a composition of non-linear functions. In general, steps will 
still need to be taken in the composite /i, but again, currently established methods 
can be applied. 

Since splitting can occur at any level, a natural question is at which layer should 
the splitting occur. If there was a penalty function across splitting variables, then 
standard regularization paths could be computed in fairly low dimension (min¬ 
imum 2, for a weight on W and Z). This provides a potentially smooth scaling 
across the layer dimension and/or number of layers (see the regressor selection 
problems in [Boyd et al., 2011, Section 9.1.1]). By simply re-applying the strict 
equality constraints between splitting variables in some layers, e.g. through an in¬ 
dicator penalty (p{Z), one could effectively emulate certain model designs; thus, 
through a structured approach to regularizing these terms one can perform a type 
of model selection. This approach may also explain-and emulate-the effect of 
drop-out in a non-stochastic way. 

The recursive nature of defining layers naturally leads to the construction of z- 
variables and in Section 4 and 4.1 we show how this provides an under-exploited 
relationship between these methods and deep learning. 


4 Proximal Splitting 

Let's start by considering the augmented Lagrangian, C{Z, W, K.), which takes the 
following general form for observations {{yi, 

N r 

+ ^7,1 (zu - -bi) + ^\\zi,i - Wi/i(z,, 2 ) -bif 

L—1 

+ E (^7 - bi) + ^||z;y - Wifi{z^+i) -bif) 

1=2 z 

+ ^7l {Zi,L - y^LXi - bi) + ^||z;,L - WlX; - | 

(3) 

where k, / G and k, / G JC are Lagrange multipliers and yi i G 1R+. The form 
here does not easily highlight the role of each term across observations and layers, 
nor the relationship between terms. As a result, in what follows, we recast the 
Lagrangian into forms that clarify the relationship. 

We make extensive use of vectorization, so many of the result are obtained by 
using the following identities 

vec(AB) = {I Z> A) vec(B) = (B””" (g) I) vec(A) 
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which, in our case, give 


vec{Wifi{Zi+i)) = W/)//(z/+i) = ^ I^)wi 

where ivi = vec(W/), for the stacked by observation terms 

Z/ = (zi; ... z^j) G and Z/ = vec(Z;) G R^''^ 

X = (xi... Xn) G R^^^ and = (^i^/... ^ IR+ • 

We also extend W/ and //(Z;y+i) with N; = N/ + 1 and 

w, = (b, Wi) € rn^xN'+i, f,(zi,i+i) = e 

w, = vec(Wi) € E«rN,+i, /,{Z,+i) = e 


which includes /L(zi,L+i) •= ^i- This means that the bi terms are members of the 
primal variable set W. 

We introduce the scaled Lagrange multipliers 


u 


T 

1 


^ e rN/-n _ 

Fi,i Fn,i ) 


We then obtain 


max min 
u w,z 


(piW) + L{y,zi) + J^h\zi 
1=1 ^ 



(In® W/)/;(z/+i) - (In ^ bi) + ui\\l 

n 


n]axmin|(/)(W) + L(y,zi) + J^-\\zi 



{In^Wi) fi{zi+i) + ui\\l^^ 


(4) 


where Jjv is the N x N identity matrix and 

N 

= Diag(y; = 0 e R^ 

i=l 


II 112 ~r 

iimer-product norm ||x||y^ := x Ax and with ^ denoting the Kronecker product 
and 0 the direct sum. is a block diagonal, so it can be factored in "square root" 
fashion. 
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Naturally, operations across layers I can also be vectorized. First, let = 
N Ln=l N, = N En=l and 


= (Z]'. .,z/ ) G R 




T 


)Nz 


U ' = ( U-r' . .,u 7 ) G R 


)Nz 


A/r X JVr 


n—1 


The linear first-order difference maps are defined by 

rNz 

/O (7n®Wi) 


ArT, : R^^ ^ R^^ 


Aw '■— Inz 

Vo 

= I - Qwf 


0 

0 


0 

[In 0 W 2 ) 0 

0 


0 \ 


/ 


0 (7n0^l)/ 


where the 0 matrices match in dimension, Qw G R^^ x(N.+N.M) ^nd 

T 


/oz := (^zJJi[zJ)z...jL-i[zl)zVec[X)^ 

Now, with Piz = Zi, (4) becomes 

f 1 2 1 2 

maxmin < d{iv) + L{y,Piz) + -\\AwZ + w|L — -||w||, 
M W,Z 2 I* 2 

In terms of w, our problem is 

r 1 2 1 

maxmin < d{iv) + L{y,Piz) + -\\Aziv — z — u|L — -\\u 
U W,Z [ 2 r‘ 2 


|2 

Ia„ 


(5) 


( 6 ) 


(7) 


where 

:= © (/„(Z,T+l) » Jn.) e . (8) 

n=l 

Note the relationship between the two operators, i.e. 

(7 + 2\h;)z = AzW . 

Equations (6) and (7) provides a simple form thaf shows how our problem dif¬ 
fers from the problems commonly considered in the basic operator splitting liter¬ 
ature, in which ADMM, Douglas-Rachford and the inexact-Uzawa techniques are 
developed. In these cases the general constraint is usually given as Aw + Bz = c, 
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for linear operators A and B. Our problem involves an operator, that intro¬ 
duces a multiplicative relationship between the primal variable w and the dual z. 
This form could be interpreted as-or related to-a bi-convex problem (for convex 
/;, naturally), especially when An) is bi-affine Boyd et al. [2011]. 

4.1 Proximal Operators and ADMM 

The proximal operator is defined by 

prox(x) := argmin|g(z) + h\z-x\\\ 

Ag z I z 

for positive definite A. Normally, A = AI and the operator reduces to 

prox(x) := argmin 

z 

When A is diagonal and positive, such as above, one could simply use its in¬ 
verse to rescale the terms in the problem (effectively g{z) g{A~^^^z) := g{z) 
and X however, we use the matrix inner-product norm for notational 

convenience and the implication of wider applicability 

The proximal operator enjoys many convenient properties, especially for lower 
semi-continuous, convex g, and has strong coimections with numerous optimiza¬ 
tion techniques and fixed-point methods [Boyd and Vandenberghe, 2009, Com- 
bettes and Pesquet, 2011]. 

The form of (4) in Z/ reflects the definition of the proximal operator, and, after 
some manipulation, the same is true for W;. This allows us to apply the iterative 
techniques surrounding proximal operators in what follows. 

One advantage of our approach is its embarrassingly parallel nature. The aug¬ 
mented Lagrangian leads to a block update of {yV\Z,lA). For example, if we add 
the traditional £^-ridge penalty we have a Bayes ridge regression update for the 
block (Wi,..., Wl). Proximal algorithms can also be used for non-smooth penal¬ 
ties. Our method is also directly scalable and the vectorization of our method 
allows implementation in large scale tensor libraries such as Theano or Torch. 

1 . given 

The problem, in vectorized form, is 

argmin IL(i/,Piz) + ^\\A^zAu\\\^^ 

which, given the design of A a), is not a simple proximal operator. Even if 
Ay, resulted in a simple diagonal matrix, the proximal operator of L(y,Piz) 
may not be easy to evaluate. Regardless, if this minimum is found using 
proximal approaches, it would likely be through another phase of splitting. 
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be it ADMM for the subproblem, or forward-backward/proximal gradient 
iterations. 

1 2 

To illustrate the forward-backward approach we let f (z) = jWAnjZ + 
and note that the Jacobian matrix, D f (z), is given by 


Df(z) 




(9) 


and use gradient step (note that D/(z) = V^/(z)) 

s = z^^^ — 7 VF(z^^^) 

^(f+l) = s pT I pj-ox (PiS) — PiS 


( 10 ) 

( 11 ) 


Simple forward-backward can't be expected to work well for all moderate 
to high-dimensional problems, and especially not for functions with more 
extreme non-linearities. Given the composite nature of P(z), the sensitivity 
and condition of this problem could vary drastically, so one will have to tailor 
their approach to account for this. 

2 . given 

From (7), the problem is 

r 1 2 

argmin < ^{w) + - WA^w — {z + u)\\j^^ 



It is easier to see the relationships between terms when operating on a single 
layer, and since this sub-problem is separable by layer, we proceed condi¬ 
tionally on 1. Let fi I := //(z^y+i) then 


Wi 


N 


argmin<1 (p{^i) + ]2^ (fu^^Ni)wi- (z,y + u^y) 


i=l 


= prox (a^^ (/z(Z/+i) Diag(p/) (g) I^,) (z/ + ui) 


= prox<^ vec((Z, + Ul) Diag(p/)//(Z;T i)) 


( 12 ) 


14 








where 


® In,'^ 


N 


N 



— fi(^i+i) Diag(^/)/;(Z/+i)'^ (g) 

and is a right pseudo-inverse. See Appendix A for details. 


The resulting proximal problem involves a quadratic in Wi that is no longer 
strictly diagonal in its squared term and, thus, isn't necessarily a simple 
proximal operator to evaluate. The operator splitting that underlies ADMM, 
Douglas-Rachford and similar techniques is a common approach to this type 
of problem. Also, if full decompositions (e.g. SVD, eigen) of A^^ are reason¬ 
able to compute at each iteration, one could proceed by working in trans¬ 
formed W] coordinates; however, the transform will induce a dependency 
between components of Wi in (p, which may require proximal solutions that 
are themselves difficult to compute. The composite methods in Argyriou 
et al. [2013], Chen et al. [2013] are designed for such transformed problems, 
at least when the transform is positive-definite, but given the dependency on 
fi, such conditions are not easy to guarantee in generality. 

Other approaches for solving this sub-problem are forward-backward ifer- 
ations and ADMM; each approach should be considered in light of /;;. A 
forward-backward approach may be trivial to implement, especially when 
fij has a known bound, but convergence can be prohibitively slow for poorly 
conditioned Au,. Some ADMM approaches can lead to faster convergence 
rates, but at the cost of repeated matrix inversions, which-for structured ma¬ 
trix problems-may be trivial. 

For example, a forward-backward approach for lower semi-continuous, con¬ 
vex (p would consist of the following fixed-point iterations 


ivi = prox (wj — Xzv'^F{wi)) 




where A^; > 0 and 



with = Zi^i + Ui i and 


N 


dwi — ® ^Ni) ^i,h 


i=\ 

N 

Cwi = y 

^ : _ 1 


i=\ 
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If VF(w/) is Lipschitz continuous with constant jzv, then > 7 ^,; otherwise, 
line-search can be used to find a sufficient jw at every iteration. 

An ADMM approach could result in a single primal proximal step in F, 
which involves inversions of a quantity like At + A^;,, and, given the form 
of Aw^, may be possible to compute via the well-known identity 


I + UV 


T 


-1 


uv'^ 

1 + v^u 


3. given 


This involves the standard cumulative error 


update for the augmented Lagrangian, which is 

+ A 




w,(i) 

W ^ 


(13) 


Our approach requires a "consensus" across observations in Step 2, but the 
remaining steps are free to be processed asynchronously in observation dimension 
N. The structure of both Ay^ and essentially determine the separability, since 
they each act as "correlation" matrices between z and w. Observing (5), we see that 
Clw sums "horizontally" across layers, but in independent blocks of observations. 


4.2 Model Selection and Architecture Design 

One advantage of viewing a deep learning predictor as a Bayes MAP estimator is 
that we can seamlessly apply Bayesian model selection tools to determine optimal 
architecture design. We will provide a model selection criterion for choosing be¬ 
tween different link functions, size of the number of units together with the num¬ 
ber of layers. Given the probabilistic data generating model, p(i/|y) and a deep 
learning estimator y{X,yVi) based on L layers, we propose the use of an infor¬ 
mation criteria (IC) to gauge model quality. Such measures like AIC, corrected 
AIC Hurvich and Tsai [1991], BIG and others George and Foster [2000] have a long 
history as model selection criteria. From a Bayesian perspective, we can define 
the Bayes factor as the ratio of marginal likelihoods and also provide an optimal 
model averaging approach to prediction. 

An information measure for a deep learning predictor, y, or equivalently a 
given architecture design, falls into a class of the form 

JC(y(L)) = -2 log p(y|y(A-, Wt) + c • df . 

where df is a measure of complexity or so-called degrees of freedom of a model and c 
is its cost. The degrees of freedom term can be defined as df := o'~^ Cov{yi, yi) 
where is the model estimation error. 

The intuition is simple-the first term assesses in-sample predictive fit. How¬ 
ever, over-fitting is the curse of any nonlinear high dimensional prediction or 
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modeling strategy and the second term penalises for the complexity in architec¬ 
ture design-including the nonlinearities in the links, number of units and layer 
depth. The combination of terms provides a simple metric for the comparison of 
two architecture designs-the best model provides the highest IC value. 

For suitably stable predictors, y, Stein [1981] provides an unbiased estimator of 

risk using the identity df = E dyi/dy^. Given the scalability of our algo¬ 

rithm, the derivative ^y /^y is available using the chain rule for the composition of 
the L layers and computable using standard tensor libraries such as Torch. 

Efron [1983] provides an alternative motivation for model selection by directly 
addressing the trade-off between minimising out-of-sample predictive error and 
in-sample fit. Consider a nonparametric regression under £^-norm. The in-sample 
mean squared error is err = Hy — y|P and the out-of-sample predictive MSE is 
Err = Ey* (| |y* — y| P) for a future observation y*. In expectation we then have 

E (Err) = E (err -|- 2var(y,y)) 

The latter term can be written in terms of df as a covariance. Stein's unbiased risk 
estimate then becomes 


ETr=|ly-#lP + 2.^E|. 

Models with the best predictive MSE are favoured. This approach also provides 
a predictive MSE criterion for optimal hyper-parameter selection in the prior reg- 
ularisation penalty (p(W) and allow the researcher to gauge the overall amount 
of regularisation necessary to provide architectures that provide good predictive 
rules. In contrast, the current state-of-the-art is to use heuristic rules such as dropout, 


5 Applications 

To illustrate our methodology, we provide an illustration of multi-class deep logis¬ 
tic learning. We use logit/softmax activation functions which allows us to employ 
some efficient quadratic envelopes to linearize the functions in the proximal sub¬ 
problems within our algorithm. We use Fisher's Iris data to illustrate the compar¬ 
ison between DL models and traditional classification algorithms such as support 
vector machines. 

5.1 Multi-Class Deep Logistic Regression 

Suppose that we have a multinomial loss with K-classes, where the observations 
are Y G and fi{z) = ( t { z ) are all logistic functions. Now, (11) has an explicit 
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form as 


N f /Ni \ Ni ^ 

L{y,z,) = EI log (^ \ 

= E {log - i^rzu} 

;=1 

= log(lJ/^i)lN - tr(y^Zi) 

= In log (/n ® ij) ) - y^zi 

with Zi G Ni = K, y = vec(y) and Zi = vec(Zi). Equation (11) is now 

zi = argmin| 7 L(y,s) + ^||s-y||5v^J 
where y = Pi (z - 7 Vf (z)) and s G 

Since this function has a Lipschitz bounded derivative with constant 71 = 1/4, we 
can use forward-backward to converge to a solution of the proximal problem. In 
this case, the sub-problem forward-backward steps are 

s= prox (s —— VL(y, s)) 

= prox (y) 

A,,iL(y,-) 

with p = Ni 0 (EiNi 0 Ij^) and 

VL(y,zi) = p-y 
V^L(y,Zi) = Diag(p) - ppT 

for £1 = 7]v 0 ij and 0 signifying element-wise division. 

All of the methods described here require derivative and/or Hessian informa¬ 
tion at some stage in the sub-problem minimization steps, and for the logistic trans¬ 
form those quantities are easily obtained: 

D f{x) = Diag (g(x) © (1 - g(x))) 

H f{x) = Diag (g(x) © (g(x) - 1) © (1 - 2(r{x))) 

where © is the Hadamard/element-wise product. In the fully vectorized form 
of the model given in Section 4.1, these quantities will need to be augmented by 
the structural properties of mappings like /(z). Such operations are perhaps best 
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suited for tensor and graph libraries, especially if they're capable of distributed 
the operations and handling the sparsity inherent to the model's design. We do 
not cover those details here, but our implementations have shown that standard 
sparse matrix libraries can be leveraged to perform such operations without too 
much coding effort. 


% non-zero w 

'Yw 

m 

1.00 

0 

0.10 

0.54 

0.67 

0.10 

0.40 

1.33 

0.10 

0.40 

2 

0.10 

1.00 

0 

1.00 

0.40 

0.67 

1.00 

0.35 

1.33 

1.00 

0.29 

2 

1.00 

1.00 

0 

1.50 

0.33 

0.67 

1.50 

0.29 

1.33 

1.50 

0.28 

2 

1.50 


Table 2: Percentage of non-zero w entries for the final parameter set in the pe¬ 
nalized model with logit activations and multinomial/softmax loss. 


5.2 Iris Data 

Fisher's Iris flower dataset consists of four measurements from 50 observafions 
of three species (K = 3) of Iris. The species are Iris setosa, Iris virginica and Iris 
versicolor and we have 150 fotal observations. The various Iris species have petals 
and sepals (the green sheaths at the base of the petals) of different lengths which 
are measured in cm and dim{X) = 4. The goal is to predict the label given by 
species name as a function of the four characteristics. 

To illustrate our methodology, we use a deep learning architecture with a multi¬ 
nomial (or softmax) loss with a hidden logit layer with a softmax link for fi. The 
hidden layer uses 10 units and so we have L = 2 and N 2 = 10. Section 5.1 provides 
details of the loss structure and objective function. To induce sparsity we use an 
penalization. One purpose of our study is to show how a sparse model can per¬ 
form as well as a dense NN model. The construction of (p{w) includes a sparsity 
parameter 7 ^; so that the penalty term is effectively 

Nu, 

(p{w) := 7a; \wj\ . 

;=i 

By varying 7 h; we can find the full regularisation path for our deep learning model 
much in the same way as a traditional lasso regression. 


19 






Figure 1: Primal and dual objective values for the penalized model with logit 
activations and multinomial/softmax loss. 


The parameters are estimated from 70% of the data, i.e. the training sample, 
leaving a hold-out test sample of 30%. For the same training and testing data, 
we find that the neural net library nnet Venables and Ripley [2002] in R gives a 
test error of 92% on average, which is comparable to the results produced by our 
model under the tested configurations. 

Figure 1 shows the primal and dual objective values across iterations and pa¬ 
rameter values. Figure 2 shows the classification rates. Table 2 lists the percentage 
of non-zero entries in w at the end of each fit across the given parameters. From the 
plot we can see that the penalization has a noticeable effect over the given range 
of jw, and referring back to Figure 2 we see that classification rates comparable to 
the dense w case (i.e. 'yw = 0) are obtained for sparse w. 

6 Discussion 

Deep Learning provides an exciting new area for nonlinear prediction rules in 
many applications in image processing and machine learning. High dimensional 
data reduction is achieved using a set of hidden layer variables. Our estimation 
methodology uses a proximal splitting techniques that leads to efficient imple¬ 
mentation. Our methods apply to non-convex non-differentiable regularisation 
penalties. The full regularisation path is available as is hyper-parameter selection 
via a statistical predictive mean squared error criterion. 
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Figure 2: Training and testing classification rates for the penalized model with 
logit activations and multinomial/softmax loss. 


There are many areas of fufure application. First, there's a range of models such 
as convolution neural nets where tailored ADMM methods can be constructed. 
Adding acceleration methods such as Nesterov [1983] to proximal operator steps 
and understanding the speed of convergence of these algorithms provides another 
ares of fruitful future research. There are also alternative methods for handling the 
general constrained problem in (3). The augmented Lagrangian can be defined in 
terms of a general penalty instead of the customary term. It would be worth¬ 
while to investigate which other penalties work well, if not better, for the problems 
discussed here. As well, there is a possible relation between the dropout technique 
in Deep Learning and certain types of regularization on W; and/or z;. For that 
matter, regularization on z; hasn't been explored in this context, yet it may provide 
an automatic means of exploring the space of splitting designs. 

One fruitful area of research is to develop tailored Markov chain Monte Carlo 
(MCMC) methods to allow one to correctly assess uncertainty bounds and pro¬ 
vide more efficient predictive rules. For example, a multi-class logistic regression 
can be implemented in MCMC using a Polya-Gamma data augmentation Poison 
et al. [2013] scheme. Another advantage of our regularisation approach frame¬ 
work is the application to optimal selection of hyper-parameters defined as the 
overall amount of regularisation applied to the parameter weights. The methods 
in Pereyra [2013] can be used to combine proximal approach with MCMC to obtain 
a full description of the uncertainty in parameter estimation of the weights. 
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There are many other areas of future research. For example, tailoring algo¬ 
rithms for other classes of neural net models such as recurrent neural networks, 
see Graves and others [2012] is as interesting area of study. There is also a large 
literature in statistics on non-parametrically estimating the link functions for shal¬ 
low architecture models. Until now, deep learning pre-specifies the link and it's 
an open problem to see whether one can consistently estimate this in a deep layer 
model. 
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A Block Wi Steps 


In this section we expand the quadratic term in (12). Using the indexed form of (3) 


and 



we obtain 



2 


Letting / = z, / -|- U/y — hi and expanding, we get 

-wjAw^wi - wjdu^i + Cw, 
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for 


N 

Au;, = Hifhiffi ® ^Ni = fi{Zi+i) Diag(}ii)fi(Zi+i)^ ® In, 

i=l 

N 

dw, = E ^ ^N,) ^i,i = {fi{Zi+i) Diag(^j) ® 1^,) 

i=l 

= vec ^ I T T_ \ A-fV vT 


{Zi + Ul) Diag(^/)//(Z;^i)) - fi{Zi+i)jii ® hi 


-I N -I 

i- ^ M_ ,|2 i MX ||2 


Cw, — „ 


Y^HiKuW — oII^/ 


lA 


i=l 




where = Z/ + U/ — (l]v <8) &/). Similarly, we can decompose these terms 


Aw I — H^^Hwi 8) t]V; ~ 8 tw;) 8 tjV;) 


8 In, = (td®; 8 (M; 8 In,) 


where M/ := Diag(y^) G and Hwi '■= Mifi{Zj^^) G R^'+G 
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